12/10/2023

# Systolic Array Multiplier with VLS



Author: Jayanta Chowdhury Tutor: Prof. Dr. Salim Ullah

Course: EHSD 2023

## **Table of Contents**

| 1.   | Introduction                            | 2 |  |  |  |
|------|-----------------------------------------|---|--|--|--|
| 2.   | Architectural Design                    | 2 |  |  |  |
| 2.1. | Systolic Array Organization             | 2 |  |  |  |
| 2.2. | Proposed Design                         | 3 |  |  |  |
| 3.   | Performance Evaluation                  | 5 |  |  |  |
| 3.1. | Solution 1                              | 5 |  |  |  |
| 3.2. | Solution 2                              | 5 |  |  |  |
| 3.3. | Solution 3                              | 6 |  |  |  |
| 4.   | Results                                 | 6 |  |  |  |
| 5.   | Conclusions                             | 7 |  |  |  |
| API  | PENDIX                                  | 7 |  |  |  |
|      |                                         |   |  |  |  |
| Lis  | st of Figures                           |   |  |  |  |
|      | ure 1 Matrix Multiplication             | 2 |  |  |  |
| _    | ure 2 Systolic array initial scenario   |   |  |  |  |
| Figu | are 3 Systolic array steps              | 3 |  |  |  |
| Figu | ure 4 Complete design block diagram     | 4 |  |  |  |
| Figu | ure 5 Interface view                    | 4 |  |  |  |
| Figu | ure 6 Solution_1 loop performance       | 5 |  |  |  |
| _    | ure 7 Solution_2 loop performance       |   |  |  |  |
| Figu | ure 8 Solution_3 loop performance       | 6 |  |  |  |
|      | st of Tables                            |   |  |  |  |
|      | Table 1 Criteria performance evaluation |   |  |  |  |
| Tabl | le 2 Detailed Resource utilization      | 6 |  |  |  |

## 1. Introduction

Recent industry requires multipliers which are efficient and fast for innumerable applications. There are various approaches to design a multiplier. One potential design is Systolic array-based multipliers. They have a very good property of data reuse and are faster than general purpose array multipliers. In this report we will discuss about the design, challenges as well as the performance of it on the hardware. Array multipliers could be floating point as well as integer type. Here we are focusing on the design aspect of integer multiplier(64 bit). The code base for both Vitis HLS and Vivado can be found in the repo: <a href="https://github.com/jayanta1996/Systolic-array-multiplication">https://github.com/jayanta1996/Systolic-array-multiplication</a>

## 2. Architectural Design

The design of the multiplier is done in Vitis HLS IDE from Xilinx. Vitis HLS provides high level language support to design as well as to verify modules. It also supports directives called pragma which instructs the compiler how to synthesize a kernel source code to an RTL code. Export RTL feature helps to port the RTL code to an IP block which can be imported to Vivado IDE library to do further designs. The hardware used for the experiment is a FPGA called Xilinx Ultra 96v2 onboarding a chip part no: xczu3eg-sbva484-1-i. The IP block imported is then further connected to the Zynq ARM core to send and receive data easily. The interface chosen for the communication with the ARM core is AXI-Stream. For testing, we used a simple Python script which will run on Jupiter notebook directly on the ARM core of the FPGA chip. The IP and the corresponding circuit reside on the configurable logic area of the FPGA whereas the ARM core is used to perform the communication with the external environment.

Firstly, we will discuss about systolic architecture then we will explore different design options for the multiplier using pragmas to find the best way to implement the IP onto the FPGA board.

## 2.1. Systolic Array Organization

A systolic array works by flowing data from memory through an array of connected compute units, therefore reusing data read from memory. In our case, the matrix multiply operation, compute units are made of MAC units. By reusing data read from memory, this organization is much faster and efficient. *Figure 1* is the illustration of general purpose of matrix multiplication.



Figure 1 Matrix Multiplication

A first analysis of the systolic architecture shown on *Figure 2* shows us four compute units interconnected in a systolic array manner. Each compute unit takes two values, one from top and one from left and multiply them, storing the result in a built-in accumulator register. Notice that each

compute unit will operate with the input values when they are both available. Then the data fed to the unit is passed onto the next unit.



Figure 2 Systolic array initial scenario

The data flow is indicated by the colored arrows. Orange data is multiplied to the blue data on each unit and flows downwards, while the blue data will flow rightwards. The result of the performed multiplications is accumulated into each unit's memory as shown *Figure 3*. This process will be repeated with all data fed into the matrix and in a symmetrical process. This "symmetry" means that the same process is occurring on every compute unit in an identical manner, making conceptually the scaling of this units extremely simple. Larger size arrays mean further reusing data and reducing access to memory.



Finally, the data flowing outside the array could be fed on to another array or can be simply discarded.

#### 2.2.Proposed Design

The main compute unit or processing element of the architecture is the MAC unit. Our MAC unit takes three inputs rather than traditional two input MACs named as "last", "aval" and "bval". "aval" and "bval" are the inputs fed from the systolic architecture orange and blue values as shown above whereas "last" gets the previously stored accumulated value. It is really important to instantiate the registers with "0" to avoid any garbage. It produces the output which is stored again in the local registers. This unit is instantiated multiple times to perform the complete array multiplication. Vitis HLS automatically create required flipflops to control the data flow as described earlier.

The inputs and outputs are provided to the IP through AXI stream protocol with the help of inbuilt Zynq MPSoC and an AXI DMA. The DMA works as an interface between the memory of processor PS and the AXI stream for both sending and receiving the data. The use of inbuilt AXI template generates the required buffers for full communication frame. The stream input is first stored in local

matrices from where the data is fetched and provided to the systolic architecture. Similarly, the output is stored into a local matrix which is then converted to stream protocol.

Various pragmas are tested and different solutions are generated for comparison. Performance analysis is discussed in the next heading.



Figure 4 Complete design block diagram



Figure 5 Interface view

Figure 4 and Figure 5 shows illustrations of the design from complete block diagram and Interface view. A, B and C are the matrices explained above. Local memory access as well as local FIFO are managed by DMA. AXI SMC are automatically added in the design to avoid any packet loss or clock gating issues during data transfer. sys\_matmult\_0 is our IP generated through Vitis HLS. The IP itself has the interface for AXI stream for both input and output.

#### 3. Performance Evaluation

In this section we will discuss the performance of our systolic array IP module. For most of the analysis we used Vitis HLS platform and its inbuilt analysis tool. We tried to play with different pragmas to create multiple solutions in terms of resource utilization, latency and estimated timing obtained. This allows us to choose the best solution and export it to an IP so that the functionality can be further verified through Vivado IDE.

#### 3.1.Solution 1

In this variant no pragmas are used. It can be observed that input matrices are instantiated within the local multiport ROM memory and the output matrix is within a local dual port RAM. The hotspots of the code are understood and they are exactly the loops and the systolic architecture that need to be optimized to get the best performance.



Figure 6 Solution\_1 loop performance

Figure 6 also shows that all input, output and systolic loops are flattened and pipelined. This action is automatically taken by the synthesis tool based on the flow of code written. Systolic loop received a violation of resource limitation. After further analysis the reason found to be the memory read dependency due to unavailability of enough ports. The details of the performance criteria are discussed on section Results.

#### 3.2. Solution 2

In this variant we tried to solve the shortcomings from solution 1 and to improve the design. Here pragma <u>HLS ARRAY PARTITION</u> is used for all the input and output matrices. Input matrices A and B are partitioned in dimension 1 or row wise and dimension 2 or column wise respectively whereas Output matrix C is partitioned into individual elements with dimension 0. Hence there is no requirements of storing the data in the local RAM or ROM memory but to store into LUTs which are easily accessible without memory read/write delay. We have added a pipeline pragma <u>HLS PIPELINE</u> to force pipeline under constraint to identify any improvements.



Figure 7 Solution\_2 loop performance

We observe a violation of pipeline initiation interval that it cannot achieve provided constraint. The initiation interval is the number of cycles that must elapse between issuing two operations of a given type. The initiation interval required is 34 instead of 1 which definitely creates a scope for improvement.

#### 3.3. Solution 3

This variant adds pragma <u>HLS LOOP TRIPCOUNT</u> in every loop which simply reports the tool about the minimum and maximum latency it can generate. It helps to control the addition of pipeline registers which generates latency to completely pipeline a given task. Pragma <u>HLS UNROLL</u> is used within the systolic architecture loops which creates multiple copies of the loop to perform parallel iterations if possible. Previous pragmas are removed except array partition.



Figure 8 Solution\_3 loop performance

It is observed that all the input and output matrix loops are flattened and pipelined. The violation still remains but the initiation interval reduced to 5 which is a significant improvement. Note: <u>HLS UNROLL</u> and <u>HLS PIPELINE</u> are complimentary to each other.

## 4. Results

In this section we will discuss the performance results under some criteria like latency, estimated timing achieved and resource utilization through the below given table

| Solution Name | Latency (cycles) | Estimated timing achieved (ns) | Resource utilization (%) |
|---------------|------------------|--------------------------------|--------------------------|
| Solution 1    | 142              | 6.919                          | 2.2                      |
| Solution 2    | 211              | 7.207                          | 5.1                      |
| Solution 3    | 94               | 7.207                          | 7.1                      |

Table 1 Criteria performance evaluation

| Solution Name | BRAM<br>av. 432 | DSP<br>av. 360 | FF<br>av. 141120 | LUT<br>av. 70560 |
|---------------|-----------------|----------------|------------------|------------------|
| Solution 1    | 8               | 14             | 2231             | 2421             |
| Solution 2    | 0               | 14             | 6178             | 4667             |
| Solution 3    | 0               | 56             | 8769             | 6264             |

Table 2 Detailed Resource utilization

Table 1 and Table 2 indicates the IP performance with respect to the decided criteria and detailed utilization of the HW resources. Solution 1 seems better but it uses BRAM within the chip which is not usually a good practice also it achieves less timing due to its high latency. Solution 2 and Solution

3 achieves same timing but they differ in the resource utilization and latency. Solution 2 uses less resource but has more latency. Solution 3 achieves less latency by sacrificing DSP element which is again a costly module within a chip. Hence the best solution can be decided based on the use cases. All solutions are exported as an IP and tested with the Zynq MPSoC and the result obtained is accurate.

## 5. Conclusions

In this project we target the domain of matrix multiplication with a Systolic array architecture as it is very famous for wide variety of applications. The design is vendor oriented since the acceleration of the design is mostly performed through the usage of DSPs which are vendor dependent. Communication protocol AXI Stream is chosen which is commonly used protocol to receive requests from the main processor and write data into memory. The design is configurable to any row and column of the input matrix but as the design is in Vitis HLS we need to import the RTL to Vivado for application.

Even though the current design can be further improved we successfully achieved our goal of keeping the granularity to the PE level. The design can be further exploited to work with tiling methodology which are currently a hot topic within many researchers as it opens the door of large matrix multiplication with very less resources.

## <u>APPENDIX</u>

- FPGA: Field Programmable Gate Array
- ARM: Advanced Risc Machine
- IP: Intellectual Property
- MPSoC: Multi Processor System on Chip
- AXI: Advanced Extensible Interface
- DMA: Direct Access Memory
- SMC: Smart Connect
- RTL: Register Transfer Level
- HDL: Hardware Description Language
- RAM: Random Access Memory
- ROM: Read only Memory
- LUT: Look up table
- FIFO: First in First out
- MAC: Multiply Accumulate
- HLS: High Level Synthesis
- DSP: Digital Signal Processor
- PE: Processing Engine
- HW: Hardware
- IDE: Integrated Development Environment